ETC5521: Diving Deeply into Data Exploration

Going beyond two variables, exploring high dimensions

Professor Di Cook

Department of Econometrics and Business Statistics

Outline

  • What is high-dimensional data? (If all variables are quantitative)
  • Exploring relationships between more than two variables
    • Tours - scatterplots of combinations of variables
    • Matrix of plots
    • Parallel coordinates
  • What can be hidden
  • Automating the search for pairwise relationships using scagnostics
  • Linking elements of multiple plots
  • Exploring multiple categorical variables

Flatland

Click here to watch video

Trailer for “FLATLAND 2: SPHERELAND”. Original book, and movie information at wikipedia

High-dimensional shapes: shadows and slices

Low-dimensional shapes in high-dimensions

What is high-dimensions?

When all variables are quantitative, an extra variable adds an extra orthogonal axis. It has a name, Euclidean space which dates back to the ancient Greeks.

Features to find

Feature Example Description
linear form The shape is linear
nonlinear form The shape is more of a curve
outliers There are one or more points that do not fit the pattern on the others
clusters The observations group into multiple clumps
gaps There is a gap, or gaps, but its not clumped
barrier There is combination of the variables which appears impossible
l-shape When one variable changes the other is approximately constant
discreteness Relationship between two variables is different from the overall, and observations are in a striped pattern

Any of the features from 2D are patterns to find in higher dimensions.

A movie of linear combinations: tour

Grand tour

Code
library(palmerpenguins)
f_std <- function(x) (x-mean(x, na.rm=TRUE))/sd(x, na.rm=TRUE)
p_std <- penguins |>
  select(bill_length_mm:body_mass_g, species) |>
  rename(bl = bill_length_mm,
         bd = bill_depth_mm,
         fl = flipper_length_mm,
         bm = body_mass_g) |>
  na.omit() |>
  mutate(bl = f_std(bl),
         bd = f_std(bd),
         fl = f_std(fl),
         bm = f_std(bm))
animate_xy(p_std[,1:4], axes="off")
render_gif(p_std[,1:4], grand_tour(), display_xy(axes="off"),
           gif_file = "images/penguins_grand.gif",
           frames = 50,
           start = basis_random(4,2))

How many clusters?

Code
animate_xy(p_std[,1:4], axes="off", col=p_std$species)
render_gif(p_std[,1:4], grand_tour(),
           display_xy(col=p_std$species, axes="off"),
           gif_file = "images/penguins_grand_sp.gif",
           frames = 50,
           start = basis_random(4,2))

The clusters correspond the three species.

What does linear combination of variables mean?

Click to see demo

Guided tour

Manual tour

Slice tour

Scale your data!

Static plots of multivariate data

Simpler: scatterplot matrix

Parallel coordinate plot

What you might miss without a tour

Hidden structure

Famous example: RANDU

Automating the search with scagnostics

Linking elements of multiple plots

Exploring multiple categorical variables

Resources